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Abstract 



In a previous publication we proposed discrete global optimization as a method to train a 
strong binary classifier constructed as a thresholded sum over weak classifiers. Our motivation 
was to cast the training of a classifier into a format amenable to solution by the quantum adi- 
abatic algorithm. Applying adiabatic quantum computing (AQC) promises to yield solutions 
that are superior to those which can be achieved with classical heuristic solvers. Interestingly 
we found that by using heuristic solvers to obtain approximate solutions we could already gain 
an advantage over the standard method AdaBoost. In this communication we generalize the 
baseline method to large scale classifier training. By large scale we mean that either the car- 
dinality of the dictionary of candidate weak classifiers or the number of weak learners used in 
the strong classifier exceed the number of variables that can be handled effectively in a sin- 
gle global optimization. For such situations we propose an iterative and piecewise approach 
in which a subset of weak classifiers is selected in each iteration via global optimization. The 
strong classifier is then constructed by concatenating the subsets of weak classifiers. We show 
in numerical studies that the generalized method again successfully competes with AdaBoost. 
We also provide theoretical arguments as to why the proposed optimization method, which does 
not only minimize the empirical loss but also adds LO-norm regularization, is superior to ver- 
sions of boosting that only minimize the empirical loss. By conducting a Quantum Monte Carlo 
simulation we gather evidence that the quantum adiabatic algorithm is able to handle a generic 
training problem efficiently. 
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1 Baseline System 



In [NDRM08] we study a binary classifier of the form 

y = H(x) = sign Wihi(x)\ , (1) 

where x G M. M are the input patterns to be classified, y G {— 1, +1} is the output of the classifier, 
the hi : x i— > {— 1, +1} are so-called weak classifiers or features detectors, and the W{ G {0, 1} are 
a set of weights to be optimized during training. H(x) is known as a strong classifier. 

Training, i.e. the process of choosing the weights wi, proceeds by simultaneously minimizing 
two terms. The first term, called the loss L(w), measures the error over a set of S training examples 
{(x s ,y s )\s = 1, . . . , S}. We choose least squares as the loss function. The second term, known 
as regularization R(w), ensures that the classifier does not become too complex. We employ a 
regularization term based on the LO-norm, || w ||o- This term encourages the strong classifier to be 
built with as few weak classifiers as possible while maintaining a low training error. Thus, training 
is accomplished by solving the following discrete optimization problem: 
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Note that in our formulation, the weights are binary and not positive real numbers as in AdaBoost. 
Even though discrete optimization could be applied to any bit depth representing the weights, we 
found that a small bit depth is often sufficient [NDRM08 ]. Here we only deal with the simplest case 
in which the weights are chosen to be binary. 



2 Comparison of the baseline algorithm to AdaBoost 

In the case of a finite dictionary of weak classifiers {hi(x)\i = 1, N} AdaBoost can be seen as a 
greedy algorithm that minimizes the exponential loss [Zha04 ], 

a op * = arg mm exp (-y s ^ aihi{x s )^ \ , (3) 

with «j G M + . There are two differences between the objective of our algorithm (Eqn. 2) and the 
one employed by AdaBoost. The first is that we added LO-norm regularization. Second, we employ 
a quadratic loss function, while Adaboost works with the exponential loss. 
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It can easily be shown that including LO-norm regularization in the objective in Eqn. (2) leads 
to improved generalization error as compared to using the quadratic loss only. The proof goes as 
follows. An upper bound for the Vapnik-Chernovenkis dimension of a strong classifier H of the 
form H(x) = Ylt=i ht{%) is given by 

VC H = 2{VC {hi} + 1)(T + 1) log 2 (e(T + 1)) , (4) 

where 7Cn.j is the VC dimension of the dictionary of weak classifiers |FS95|. The strong classi- 
fier's generalization error Errortest has therefore an upper bound given by BVC71I 



Errortest < Errortrain + V — ^ V ° H g " ~ ■ ( 5 ) 

It is apparent that a more compact strong classifier that achieves a given training error Error tra in 
with a smaller number T of weak classifiers (hence, with a smaller VC dimension VCjj), comes 




Figure 1: AdaBoost applied to a simple classification task. A shows the data, a separable set con- 
sisting of a two-dimensional cluster of positive examples (blue) surrounded by negative ones (red). 
B shows the random division into training (saturated colors) and test data (light colors). The dictio- 
nary of weak classifiers is constructed of axis-parallel one-dimensional hyperplanes. C shows the 
optimal classifier for this situation, which employs four weak classifiers to partition the input space 
into positive and a negative areas. The lower row shows partitions generated by AdaBoost after 10, 
20, and 640 iterations. The configuration at T = 640 is the asymptotic configuration that does not 
change anymore in subsequent training rounds. The "breakout regions" outside the bounding box of 
the positive cluster occur in areas in which the training set does not contain negative examples. This 
problem becomes more severe for higher dimensional data. Due to AdaBoost's greedy approach, 
the optimal configuration is not found despite the fact that the weak classifiers necessary to construct 
the ideal bounding box are generated. In fact AdaBoost fails to learn higher dimensional versions 
of this problem altogether with error rates approaching 50%. See section 6 for a discussion on how 
global optimization based learning can handle this data set. 
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with a guarantee for a lower generalization error. Looking at the optimization problem in Eqn. 2, 
one can see that if the regularization strength A is chosen weak enough, i.e. A < + jj?, then the 
effect of the regularization is merely to thin out the strong classifier. One arrives at the condition 
for A by demanding that the reduction of the regularization term AR(w) that can be obtained by 
switching one wi to zero is smaller than the smallest associated increase in the loss term AL(w) that 
comes from incorrectly labeling a training example. This condition guarantees that weak classifiers 
are not eliminated at the expense of a higher training error. Therefore the regularization will only 
keep a minimal set of components, those which are needed to achieve the minimal training error that 
was obtained when using the loss term only. In this regime, the VC bound of the resulting strong 
classifier is lower or equal to the VC bound of a classifier trained with no regularization. 

AdaBoost contains no explicit regularization term and it can easily happen that the classifier 
uses a richer set of weak classifiers than needed to achieve the minimal training error, which in turn 
leads to degraded generalization. Fig. 1 illustrates this fact. 

In practice we do not operate in the weak A regime but rather determine the regularization 
strength A by using a validation set. We measure the performance of the classifier for different 
values of A on a validation set and then choose the one with the minimal validation error. In this 
regime, the optimization performs a trade-off and accepts a higher empirical loss if the classifier can 
be kept more compact. In other words it may choose to misclassify training examples if it can keep 
the classifier simpler. This leads to increased robustness in the case of noisy data, and indeed we 
observe the most significant gains over AdaBoost for noisy data sets when the Bayes error is high. 
The fact that boosting in its standard formulation with convex loss functions and no regularization 
is not robust against label noise has drawn attention recently [LS08][Fre09|. 

The second difference to our baseline system, namely that we employ quadratic loss while Ad- 
aBoost works with exponential loss is of smaller importance. In fact, the discussion above about 
the role of the regularization term would not have changed if we were to choose exponential rather 
than square loss. Literature seems to agree that the use of exponential loss in AdaBoost is not es- 
sential and that other loss functions could be employed to yield classifiers with similar performance 
[FHT98 ][Wyn02]. From a statistical perspective, quadratic loss is satisfying since a classifier that 
minimizes the quadratic loss is consistent. With an increasing number of training samples it will 
asymptotically approach a Bayes classifier i.e. the classifier with the smallest possible error MZha0411 . 

3 Generalization to large scale classifiers 

The baseline system assumes a fixed dictionary containing a number of weak classifiers small 
enough, so that all weight variables can be considered in a single global optimization. This ap- 
proach needs to be modified if the goal is to train a large scale classifier. Large scale here means 
that at least one of two conditions is fulfilled: 

1 . The dictionary contains more weak classifiers than can be considered in a single global opti- 
mization. 

2. The final strong classifier consists of a number of weak classifiers that exceeds the number of 
variables that can be handled in a single global optimization. 

Let us take a look at typical problem sizes. The state-of-art heuristic solver ILOG CPLEX can 
obtain good solutions for up to 1000 variable quadratic binary programmes depending on the coef- 



4 



ficient matrix. The quantum hardware solvers manufactured by D-Wave currently can handle 128 
variable problems. In order to train a strong classifier we often sift through millions of features. 
Moreover, dictionaries of weak learners are often dependent on a set of continuous parameters such 
as thresholds, which means that their cardinality is infinite. We estimate that typical classifiers em- 
ployed in vision based products today use thousands of weak learners. Therefore it is not possible to 
determine all weights in a single global optimization, but rather it is necessary to break the problem 
into smaller chunks. 

Let T designate the size of the final strong classifier and Q the number of variables that we can 
handle in a single optimization. Q may be determined by the number of available qubits, or if we 
employ classical heuristic solvers such as ILOG CPLEX or tabu search [Pal04], then Q designates 
a problem size for which we can hope to obtain a solution of reasonable quality. We are going to 
consider two cases. The first with T < Q and the second with T > Q. 

We first describe the "inner loop" algorithm we suggest if the number of variables we can handle 
exceeds the number of weak learners needed to construct the strong classifier. 

Algorithm IT < Q (Inner Loop) 

Require: Training and validation data, dictionary of weak classifiers 
Ensure: Strong classifier 



Initialize weight distribution di nner over training samples as uniform distribution Vs : di nner (s) = i 
Set Ti nner 



repeat 

Select the Q — Ti nner weak classifiers hi from the dictionary that have the smallest weighted 
training error rates 
5: for A = X min to X max do 

6: Run the optimization w opt = argmin,,, fef =1 (g J2?=i Wihi{x s ) - y s ) 2 + X \\ w || ) 

7: Set Ti nner =\\ W || 

8: Construct strong classifier H(x) = sign w2t=i ner ht{ x )j summing up the weak classifiers 

for which Wi = 1 

9: Measure validation error Error va i of strong classifier on unweighted validation set 

10: end for 

11: Keep Ti nner , H(x) and Error va i for the A run that yielded the smallest validation error 
12: Update weights d inner (s) = d lnner (s) Y^x" h t( x ) ~ 2M 

13: Normalize di nner (s) 



d (nner (s) 

14: until validation error Error va i stops decreasing 



A way to think about this algorithm is to see it as an enrichment process. In the first round, the 
algorithm selects those Ti nner weak classifiers out of subset of Q that produce the optimal validation 
error. The subset of Q weak classifiers has been preselected from a dictionary with a cardinality 
possibly much larger than Q. In the next step the algorithm fills the Q — Ti nner empty slots in the 
solver with the best weak classifiers drawn from a modified dictionary that was adapted by taking 
into account for which samples the strong classifier constructed in the first round is already good and 
where it still makes errors. This is the boosting idea. Under the assumption that the solver always 
finds the global minimum, it is guaranteed that for a given A the solutions found in the subsequent 
round will have lower or equal objective value i.e. they achieve a lower loss or they represent a more 
compact strong classifier. The fact that the algorithm always considers groups of Q weak classifiers 
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simultaneously rather than incrementing the strong classifier one by one and then tries to find the 
smallest subset that still produces a low training error allows it to find optimal configurations more 
efficiently. 

If the validation error cannot be decreased any further using the inner loop, one may conclude 
that more weak classifiers are needed to construct the strong one. In this case the "outer loop" 
algorithm "freezes" the classifier obtained so far and adds another partial classifier trained again by 
the inner loop. 



Algorithm 2 T > Q (Outer Loop) 

Require: Training and validation data, dictionary of weak classifiers 
Ensure: Strong classifier 

1: Initialize weight distribution d outer over training samples as uniform distribution Vs : d outer (s) = 
2: Set T OU f er — 
3: repeat 

4: Run Algorithm 1 with di nner initialized from current d ou t er and using an objective function that 
takes into account the current H(x): 

w°* = argmin tt , (tLi( t. J er +Q (X£i* r h M + E?=i ™Mx s )) - y s f + A || w || ) 

^j™'" h t (x) + ^2 t ^r° r t +i" h t (x) j adding those 

weak classifiers for which Wi = 1 

6- Set T ou i er — T ou i er -\- Ti nner ^ 

7: Update weights d outer (s) = d outer (s) (w^; J2t=T r h t( x ) ~ Vs) 

8: Normalize d outer (s) = ^/°°^' M 

9: until validation error Error va i stops decreasing 



4 Implementation details and performance measurements 

To assess the performance of binary classifiers of the form (1) trained by applying the outer loop 
algorithm, we measured their performance on synthetic and natural data sets. The synthetic test 
data consisted of 30-dimensional input vectors generated by sampling from P(x,y) = \8{y — 
l)iV(x|yLt + , /) + \${y + l)A r (x|/i_, /), where N(x\fi, S) is a spherical Gaussian having mean [i 
and covariance S. An overlap coefficient determines the separation of the two Gaussians. See 
[NDRM08] for details. The natural data consists of two sets of 30- and 96-dimensional vectors of 
Gabor wavelet amplitudes extracted at eye locations in images showing faces. The input vectors are 
normalized using the L2-norm, i.e. we have || x \%= 1. The data sets consisted of 20,000 input 
vectors, which we divided evenly into a training set, a validation set to fix the parameter A, and a 
test set. We used Tabu search as the heuristic solver [Pal04]. For both experiments we employed a 
dictionary consisting of decision stumps of the form: 

h) + {x) = sign{xi - 9 ; + ) for I = 1, . . . , M (6) 

h]~(x) = sign{-xi - 8~) for I = 1, . . . ,M (7) 

tif + (x) = signal - 9+-) for / = 1, . . . , (j ;i,j = l,...,M;i<j (8) 
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Figure 2: Test errors for the synthetic data set. We ran the outer loop algorithm for three different 
values of Q: Q = 64, Q = 128, and Q = 256. The curves show means for 100 runs and the 
error bars indicate the corresponding standard deviations. All three versions outperform AdaBoost. 
The gain increases as the classification problem gets harder i.e. as the overlap between the positive 
and negative example clouds increases. The Bayes error rate for the case of complete overlap, 
overlap coefficients, is 0.05. One can also see that there is a benefit to being able to run larger 
optimizations since the error rate decreases with increasing Q. For comparison, we also included the 
results from the last article [NDRM08] for a classifier based on a fixed dictionary using quadratic 
loss (QP 2) for which the training was performed as per Eqn. 2. Not surprisingly, working with an 
adaptive set of weak classifiers yields higher accuracy. 



hf (x) = sign{-XiXj - 0^) for I = 1, . . . , ( ^ j ; i,j = 1, . . . , M\i < j (9) 

Here h\ + , hj~, hf + and are positive and negative weak classifiers of orders 1 and 2 respec- 
tively; M is the dimensionality of the input vector x; x\,Xi,Xj are the elements of the input vector 
and @i , 07/, 0T. and 07". are optimal thresholds of the positive and negative weak classifiers of 
orders 1, and 2 respectively. For the 30-dimensional input data the dictionary employs 930 weak 
classifiers and for the 96-dimensional input it consists of 9312 weak learners. 

As in [NDRM08] we compute an optimal threshold for the final strong classifier according to 

= ^ Y^s=i Sili w ° pt h.i(x s ). The final classifier then becomes y = sign f^ili w ° pt hi{ x ) ~ ©J- 
In a separate set of experiments we co-optimized jointly with the weights wi. For the datasets 
we studied the difference was negligibly small but we do not expect this to be generally the case. 
Note that in order to handle the multi-valued global threshold within the frame work of discrete 
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AdaBoost 


Outer Loop 
Q = 64 


Outer Loop 
Q = 128 


Outer Loop 
Q = 256 


Test Error 


0.258 ± 0.006 


0.254 ±0.010 


0.249 ±0.010 


0.246 ± 0.009 


Weak Classifiers 


257.8 ± 332.1 


116.8 ± 139.0 


206.1 ± 241.8 


356.3 ±420.3 


Reweightings 


658.9 ± 209.3 


130.8 ±65.3 


145.6 ±65.5 


159.8 ±63.7 


Training Error 


0.038 ± 0.054 


0.185 ±0.039 


0.170 ±0.039 


0.158 ±0.040 


Outer Loops 




11.9 ±5.0 


12.2 ±4.6 


12.6 ±4.4 



Figure 3: Test results obtained for the natural data set with 30-dimensional input vectors. Similar 
to the synthetic data we compare the outer loop algorithm for three different window sizes Q = 64, 
Q = 128, and Q = 256 to AdaBoost. The results were obtained for 1000 runs. The piecewise 
global optimization based training only leads to sightly lower test errors but obtains those with a 
significantly reduced number of weak classifiers. Also, the number of iterations needed to train the 
strong classifiers is more than 4 times lower than required by AdaBoost. 



optimization one has to insert a binary expansion for and the loss term then becomes L(w) = 

Ef=i (Ef=i whit.) - Etf N] e fc 2* + - 1) - vs) 2 - 

Test results for the synthetic data set are shown in Fig.|2]and the table in Fig.[3]displays results 
obtained from the natural data set. We did comprehensive tests of the described inner and outer loop 
algorithms and found that minor modifications lead to the best results. We found that rather than 
adding just Q — Ti nner weak classifiers, the error rates dropped slightly (about 10%) if we would 
replace all Q classifiers from the previous round by new ones. The objective in Eqn. (2) employs a 
scaling factor of to ensure that the unthresholded output of the classifier, sometimes referred to 
as score function, does not overshoot the { — 1, +1} labels. Systematic investigation of the scaling 
factor, however, suggested that larger scaling factors lead to a minimal improvement in accuracy 
and to a more significant reduction in the number of classifiers used (between 10-30%). Thus, to 
obtain the reported results we chose a scale factor of j| . 

To determine the optimal size T of the strong classifier generated by AdaBoost we used a vali- 
dation set. If the error did not decrease during 400 iterations we stopped and picked the T for which 
the minimal error was obtained. The results for the 96-dimensional natural data sets looked similar. 

5 Scaling analysis using Quantum Monte Carlo simulation 

We used the Quantum Monte Carlo (QMC) simulator of jFGG + 09ll to obtain an initial estimation 
of the time complexity of the quantum adiabatic algorithm on our objective function. According 
to the adiabatic theorem [Mes99], the ground state of the problem Hamiltonian Hp is found with 
high probability by the quantum adiabatic algorithm, provided that the evolution time T from the 
initial Hamiltonian Hp to Hp is where g m in is the minimum gap. Here Hp is chosen as 

Hp = Ei=i(l ~~ °f) /2- The minimum gap is the smallest energy gap between the ground state Eq 
and first excited state E\ of the time-dependent Hamiltonian H(t) = (1 — t/T)Hp + (t/T)Hp at 
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any < t < T. For notational convenience, we also use H(s) = (1 — s)Hb + sHp with < s < 1. 
More details can be found in the seminal work [FGGSOO]. 

As a consequence, to find the time complexity of AQC for a given objective function, one 
needs to estimate the asymptotic scaling of the minimum gap as observed on a collection of typical- 
case instances of this objective function. As noted in MAC09I . the task of analytically extracting 
the minimum gap scaling has been extremely difficult in practice, except for a few special cases. 
The only alternative is to resort to numerical methods, which consist of diagonalization and QMC 
simulation. Unfortunately, diagonalization is currently limited to about N < 30, and QMC to about 
N = 256, where N is the number of binary variables [YKS09]. Hence, the best that can be done 
with the currently available tools is to collect data via QMC simulations on small problem instances 
and attempt to extrapolate the scaling of the minimum gap for larger instances. 

Using the QMC simulator of ||FGG + Q9l , we indirectly measured the minimum gap by estimat- 
ing the magnitude of the second derivative of the ground state energy with respect to s, which is 
related to the minimum gap [YKS08]. This quantity is an upper bound on 2| Voi\ 2 / g m in, where 
Vol = (^o\dH /ds\^i) and ^o^i are the eigenstates corresponding to the ground and first ex- 
cited states of H. However, the quantity that one is interested in for the time scaling of AQC is 
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Figure 4: The quantity | ^-r^-s 2 (l — s) 2 \ determined by quantum Monte Carlo simulation as well as 
exact diagonalization for a training problem with 20 weight variables. For small problem instances 
with less than 30 variables, we can determine this quantity via exact diagonalization of the Hamilto- 
nian H(s). As one can see, the results obtained by diagonalization coincide very well with the ones 
determined by QMC. The training objective is given by Eqn. 2 using the synthetic data set with an 
overlap coefficient of 0.95. 
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Figure 5: A plot of the peaks of the mean of |^^ a s 2 (l — s) 2 | against the number of qubits for 
the range of 10-100 qubits. The errors bars indicate the standard deviation. Each point of the plot 
represents 20 QMC runs. The data is well fitted by a linear function. From the fact that the scaling 
is at most polynomial in the problem size we can infer that the minimum gap and hence the runtime 
of AQC are scaling polynomially as well. 

I Vol I /9min> rjut assuming that the matrix element Vbi is not extremely small, the scaling of the 
second derivative, polynomial or exponential, can be used to infer whether the time complexity of 
AQC is polynomial or exponential in N. 

Fig. 5 shows the results of a scaling analysis for the synthetic data set. The result is encouraging 
as the maxima of |^^ a s 2 (l — s) 2 | only scale linearly with the problem size. This implies that 
the runtime of AQC on this data set is likely to scale at most polynomially. It is not possible to 
make a statement how typical this data set and hence this scaling behavior is. We do know from 
related experiments with known optimal solutions that Tabu search often fails to obtain the optimal 
solution for a training problem for sizes as small as 64 variable^] Obviously, scaling will depend 
on the input data and the dictionary used. In fact it should be possible to take a hard problem 
known to defeat AQC BAC091 and encode it as a training problem, which would cause scaling to 
become exponential. But even if the scaling is exponential, the solutions found by AQC for a given 
problem size can still be significantly better than those found by classical solvers. Moreover, newer 
versions of AQC with changing paths jFGG + 09l may be able to solve hard training problems like 
this efficiently. 

'For instance we applied Tabu search to 30-dimensional separable data sets of the form depicted in Fig.l. Tabu search 
failed to return the minimal objective value for N=64 and S=9300. 
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6 Discussion and future work 



Building on earlier work [NDRM08] we continued our exploration of discrete global optimization 
as a tool to train binary classifiers. The proposed algorithms which we would like to call QBoost 
enable us to handle training tasks of larger sizes as they occur in production systems today. QBoost 
offer gains over standard AdaBoost in three respects. 

1 . The generalization error is lower. 

2. Classification is faster during execution because it employs a smaller number of weak classi- 
fiers. 

3. Training can be accomplished more quickly since the number of boosting steps is smaller. 

In all experiments we found that the classifier constructed with global optimization was significantly 
more compact. The gain in accuracy however was more varied. The good performance of a large 
scale binary classifier trained using piecewise global optimization in a form amenable to AQC but 
solved with classical heuristics shows that it is possible to map the training of a classifier to AQC 
with negative translation costs. Any improvements to the solution of the training problem brought 
about by AQC will directly increase the advantage of the algorithm proposed here over conventional 
greedy approaches. Access to emerging hardware that realizes the quantum adiabatic algorithm is 
needed to establish the size of the gain over classical solvers. This gain will depend on the structure 
of the learning problem. 

The proposed framework offers attractive avenues for extensions that we will explore in future 
work. 

Alternative loss functions 

We employed the quadratic loss function in training because it maps naturally to the quantum pro- 
cessors manufactured by D-Wave Systems which support solving quadratic unconstrained binary 
programming. Other loss functions merit investigation as well including new versions that tradi- 
tionally have not been studied by the machine learning community. A first candidate for which we 
have already done preliminary investigations is the 0-1 loss since it measures the categorical training 
error directly and not a convex relaxation. This loss function is usually discarded due to its compu- 
tational intractability, making it an attractive candidate for the application of AQC. In particular 0-1 
loss will do well on separable data sets with small Bayes error. An example is the dataset depicted 
in Fig. 1 and its higher-dimensional analogs. An objective as in Eqn. 2 employing 0-1 loss and 
including in the optimization has the ideal solution as its minimum while for AdaBoost as well 
as the outer loop algorithm with square loss the error approaches 50% as the dimension of the input 
grows larger. 

We developed two alternative objective functions that mimic the action of the 0-1 loss in a 
quadratic optimization framework. Unfortunately this is only possible at the expense of auxiliary 
variables. The new objective minimizes the norm of the labels y s simultaneously with finding the 
smallest set of weights wi that minimizes the training error. Samples that can not be classified 
correctly are flagged by error bits e s . 
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(w opt ,y op \e opt ) = ... 

= argmin V] E Wihi{x s ) - sign{y s )y s + ... 
w >v> e \ s =i\\i=i J 

(N \\ N \ 

••• + N<1 E w i h i( x s) ~ sign{y s )y s + sign(y s )Ne s J + A ^ Wj J 
= argmin ((1 + iV 2 ) V u;^ (v^^)^-^)) + ... (10) 

W ^ e \ i,j=l \s=l J 

s riog 2 Jvi-i riog 2 An-i 

... + (i + iv 2 )^(y t 2 + 2y t Yl y^ 2k+ E Wfc', s 2 (fc+fe,) ) + ... 

s=l fc=0 fc,fc'=0 

5 AT 5 riog 2 An-l 

... + 7V 4 ^e s -2(l+Af 2 )^^^si ff n(y s )/i i (a; s )(y t + y M 2 fe ) + ... 

8 = 1 1=1 S=l fc=0 

Ar 5 5 riog 2 7Vl-l at \ 

... + 2N 3 ^2^2wie s sign(y s )hi(x s ) -2iV 3 ^e s (y t + J] j/ M 2 fc ) + A^^ 

i=l s=l s=l k=0 i=l / 

with y s G {1, 2, , N}. To replace the N-valued y s with binary variables we effected a binary 

expansion y s = y^ + X^I=fj 2 ^ 1 Vk,s^ k - V\ is a constant we set to 1 for the purpose of preventing 
y s from ever becoming 0. The number of binary variables needed is N for w, S\log 2 N] for y and 
S for e. 

The computational hardness of learning objectives based on 0-1 loss manifested itself in that for 
handcrafted datasets for which we knew the solution we could see that Tabu search was not able to 
find the minimum assignment. We also conducted a QMC analysis but were not able to determine a 
finite gap size for problem sizes of 60 variables and larger. However this was a preliminary analysis 
that will have to be redone with larger computational resources. The difficulty to determine the gap 
size led us to propose an alternative version that uses a larger number of auxiliary variables but has 
a smaller range of coefficients. Samples that can be classified correctly are flagged by indicator bits 
ef. Vice versa samples that can not be classified correctly are indicated by e~ . 

(w op \y opt ,e opt ) = ... (11) 
= argmin E E w i h i( x *) ~ ( e t ~ ^)sign{y s )y s + e~ + A ^ wA 

w >y> e V.=i\\t=i / / i=i ) 

The number of binary variables needed is N for w, 5[log 2 N] for y and 2S for the ef and e~ . 
However since the objective contains third-order terms, we will need to effect a variable change to 
reduce to second order: yf = efy s and y~ = e~y s . This adds another 2S , [~log 2 N] qubits. Due to 
the large number of binary variables we have not analyzed Eqn. 1 1 yet. 
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Co-Optimization of weak classifier parameters 

The weak classifiers depend on parameters such as the thresholds 0;. Rather than determining 
these parameters in a process outside of the global optimization it would better keep with the spirit 
of our approach to include these in the global optimization as well. Thus is would look more like 
a perceptron but one in which all weights are determined by global optimization. So far we have 
not been able to find a formulation that only uses quadratic interactions between the variables and 
that does not need a tremendous amount of auxiliary variables. This is due to the fact that the weak 
classifier parameters live under the sign function which makes the resulting optimization problem 
contain terms of order N if no simplifications are effected. Our desire to stay with quadratic opti- 
mization stems from the fact that the current D-Wave processors are designed to support this format, 
and that it will be hard to represent iV-local interactions in any physical process. 

Co-Training of multiple classifiers 

Our training framework allows for simultaneous training of multiple classifiers with feature sharing 
in a very elegant way. For example if two classifiers are to learn similar tasks, then a training objec- 
tive is formed that sums up two objectives as described in Eqn. (2) — one for each classifier. Then 
cross terms are introduced that encourage the reuse of weak classifiers and which have the form 
H J2iLi( w f ~ w f) 2 - The wf and wf are the weights of classifiers A and B respectively. From the 
perspective of classifier A this looks like a special form of context dependent regularization. The 
resulting set of classifiers is likely to exhibit higher accuracy and reduced execution times. But more 
importantly, this framework may allow reducing the number of necessary training examples. 

Incorporating Gestalt principles 

The approach also allows to seamlessly incorporate a priori knowledge about the structure of the 
classification problem, for instance in the form of Gestalt principles. For example, if the goal is 
to train an object detector, it may be meaningful to impose a constraint that if a feature is detected 
at position x in an image then there should also be one at a feature position x' nearby. Similarly, 
we may be able to express symmetry or continuity constraints by introducing appropriate penalty 
functions on the weight variables optimized during training. Formally, Gestalt principles take on 
the form of another regularization term, i.e. a penalty term G(w) that is a function of the weights. 
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